Quant Assignment 5

Rania Karamallah, MC Abbott, Alex Cardelle

11/19/2021

Research Question

Is there a relationship between mode of transportation to travel to work and different health indicators? How do other demographic characteristics (eg. race and age) factor in, if at all?

Prior Research

Riggs and Sethi (2020), in examining a quantified, indexed version of “walkability”, have determined that neighborhoods scoring high in this category do show increased multimodal/alternative transportation use (namely – walking, cycling, or transit). With more specificity, Watson et. al (2020) provide a national-level overview that confirms a strong link between high walkability scores and walking itself as a mode of commuting to work throughout the urban areas of the United States. Our research question seeks to confirm these findings through selected census and walkability study data.

Data

We will be using the following data sets:

  • United States American Community Survey (2019)
  • Centers for Disease Control and Prevention: PLACES U.S. Chronic Disease Indicators (2020)
  • United States Environmental Protection Agency: National Walkability Index (2021)

Variables for Analysis

Continuous * Population Percentage with Current Mental Health Issues * Population Percentage with Obesity * Population Percentage with Current Asthma * Percentage of Mode of Transportation to Work * Walkability Index Score

Categorical * Majority White vs. Non-White * Majority Female vs. Not Female

Sample Definition

Our research will focus on the scale of census tracts in Massachusetts, with a look at adult characteristics and behaviors aggregated to the level of census tracts.

Load data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidycensus)
library(readxl)
library(knitr)
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ggthemes)
library(ggspatial)
library(jtools)
library(tigris)
## To enable 
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
library(huxtable)
## 
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
## 
##     add_rownames
## The following object is masked from 'package:ggplot2':
## 
##     theme_grey

Mode of Transportation

Source: American Community Survey (2019)

  • Mode of Transportation
acs_transport <- get_acs(geography = "tract", 
                        year = 2019,
                        variables = c(transport_any = "B08301_001E",
                                      transport_walk = "B08301_019E",
                                      transport_car = "B08301_002E",
                                      transport_public = "B08301_010E",
                                      transport_bike = "B08301_018E",
                                      transport_taxi = "B08301_016E",
                                      transport_motorcycle = "B08301_017E",
                                      transport_other = "B08301_020E"),
                        state = "MA",
                        output = "wide",
                        geometry = FALSE) %>%
  mutate(pct_walk = transport_walk / transport_any) %>%
  mutate(pct_car = transport_car / transport_any) %>%
  mutate(pct_public = transport_public / transport_any) %>%
  mutate(pct_bike = transport_bike / transport_any) %>%
  mutate(pct_taxi = transport_taxi / transport_any) %>%
  mutate(pct_motorcycle = transport_motorcycle / transport_any) %>%
  mutate(pct_other = transport_other / transport_any)%>%
  select(GEOID, pct_walk, pct_car, pct_public, pct_bike, pct_taxi, pct_motorcycle, pct_other)

Demographics

Source: American Community Survey (2019)

  • Majority White vs. Non-White (categorical)
  • Majority Female vs. Not Female (categorical)
race_majority <- get_acs(geography = "tract", 
                        year = 2019,
                        variables = c( total_pop = "B02001_001",
                                       white_pop = "B02001_002"),
                        state = "MA",
                        output = "wide",
                        geometry = FALSE) %>%
  mutate(white_majority = case_when(white_popE / total_popE >= .5 ~ "Majority White",
                          white_popE / total_popE < .5 ~ "Non-Majority White",
                          TRUE ~ "unknown")) %>%
  select(GEOID,white_majority)
## Getting data from the 2015-2019 5-year ACS
sex_majority <- get_acs(geography = "tract", 
                        year = 2019,
                        variables = c( total_pop = "B01001_001",
                                       female_pop = "B01001_026"),
                        state = "MA",
                        output = "wide",
                        geometry = FALSE) %>%
  mutate(female_majority = case_when(female_popE / total_popE >= .5 ~ "Female Majority",
                          female_popE / total_popE < .5 ~ "Non-Female Majority",
                          TRUE ~ "unknown")) %>%
  select(GEOID,female_majority)
## Getting data from the 2015-2019 5-year ACS

U.S. Chronic Disease Indicators

Source: United States Centers for Disease Control and Prevention: PLACES U.S. Chronic Disease Indicators

  • Population Percentage with Current Mental Health Issues (continuous)
  • Population Percentage with Obesity (continuous)
  • Population Percentage with Current Asthma (continuous)
Mental_Health <- read_csv('data/PLACES__Local_Data_for_Better_Health__Census_Tract_Data_2020_release.csv') %>%
  filter(Year == 2018) %>%
  filter(Short_Question_Text == "Mental Health") %>%
  filter(StateAbbr == "MA") %>%
  rename(GEOID = LocationID) %>%
  group_by(GEOID) %>%
  rename(pct_mental = Data_Value) %>%
  select(GEOID, pct_mental, Data_Value_Unit)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 2024865 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): StateAbbr, StateDesc, CountyName, CountyFIPS, LocationName, DataSo...
## dbl  (5): Year, Data_Value, Low_Confidence_Limit, High_Confidence_Limit, Tot...
## lgl  (2): Data_Value_Footnote_Symbol, Data_Value_Footnote
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Asthma <- read_csv('data/PLACES__Local_Data_for_Better_Health__Census_Tract_Data_2020_release.csv') %>%
  filter(Year == 2018) %>%
  filter(Short_Question_Text == "Current Asthma") %>%
  filter(StateAbbr == "MA") %>%
  rename(GEOID = LocationID) %>%
  group_by(GEOID) %>%
  rename(pct_asthma = Data_Value) %>%
  select(GEOID, pct_asthma, Data_Value_Unit)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 2024865 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): StateAbbr, StateDesc, CountyName, CountyFIPS, LocationName, DataSo...
## dbl  (5): Year, Data_Value, Low_Confidence_Limit, High_Confidence_Limit, Tot...
## lgl  (2): Data_Value_Footnote_Symbol, Data_Value_Footnote
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Obesity <- read_csv('data/PLACES__Local_Data_for_Better_Health__Census_Tract_Data_2020_release.csv') %>%
  filter(Year == 2018) %>%
  filter(Short_Question_Text == "Obesity") %>%
  filter(StateAbbr == "MA") %>%
  rename(GEOID = LocationID) %>%
  group_by(GEOID) %>%
  rename(pct_obesity = Data_Value) %>%
  select(GEOID, pct_obesity, Data_Value_Unit)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 2024865 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): StateAbbr, StateDesc, CountyName, CountyFIPS, LocationName, DataSo...
## dbl  (5): Year, Data_Value, Low_Confidence_Limit, High_Confidence_Limit, Tot...
## lgl  (2): Data_Value_Footnote_Symbol, Data_Value_Footnote
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Walkability

Source: United States Environmental Protection Agency: National Walkability Index

  • Walkability Index Score
# The following code was used to created the original "Walkability" file, it was then re-loaded as a .shp file
# This was done to reduce the file size and because the .shp file doesn't except "avg_w_ind" as a column name (too long apparently).

# Walkability <- st_read(dsn = "Natl_WI.gdb") %>%
#   filter(STATEFP == "25") %>%
#   select(GEOID10, NatWalkInd, TotPop) %>%
#   mutate(tract = substr(GEOID10, 1, 11)) %>%
#   st_set_geometry(NULL) %>%
#   group_by(tract) %>%
#   summarise(avg_w_ind = weighted.mean(NatWalkInd, TotPop)) %>%
#   rename(GEOID = tract)
# 
# st_write(Walkability, "data/Walkability_file.shp")
# print(Walkability)

Walkability <- st_read(dsn = "data/Walkability_file.dbf")
## Reading layer `Walkability_file' from data source 
##   `/Users/mc/Documents/2_ SCHOOL/2_ Harvard/1_ Classes/S1/6_ Quant/Assignment 5/quant_assignment_1/data/Walkability_file.dbf' 
##   using driver `ESRI Shapefile'
## Warning: no simple feature geometries present: returning a data.frame or tbl_df

Assembling the datasets

data <- left_join(acs_transport, Walkability) %>% 
  left_join(Asthma) %>%
  left_join(Obesity) %>%
  left_join(Mental_Health) %>%
  left_join(race_majority) %>%
  left_join(sex_majority) %>%
  select(GEOID, avg_w_ind, pct_walk, pct_car, pct_public, pct_bike, pct_taxi, pct_motorcycle, pct_other, pct_asthma, pct_mental, pct_obesity, white_majority, female_majority)

kable(head(data))
GEOID avg_w_ind pct_walk pct_car pct_public pct_bike pct_taxi pct_motorcycle pct_other pct_asthma pct_mental pct_obesity white_majority female_majority
25027731001 11.333333 0.0052493 0.9755031 0.0131234 0.0000000 0 0 0.0000000 10.1 14.0 30.8 Majority White Female Majority
25027709501 7.333333 0.0000000 0.9022624 0.0199095 0.0000000 0 0 0.0000000 9.7 12.6 30.7 Majority White Female Majority
25027709502 9.525729 0.0102145 0.9417773 0.0108955 0.0000000 0 0 0.0000000 10.0 12.7 30.0 Majority White Female Majority
25027721101 3.833333 0.0078125 0.9101562 0.0029297 0.0029297 0 0 0.0146484 10.0 13.5 31.0 Majority White Female Majority
25027721102 5.537969 0.0000000 0.9310201 0.0000000 0.0000000 0 0 0.0000000 9.6 12.4 29.6 Majority White Non-Female Majority
25027731002 12.806539 0.0000000 0.9193947 0.0648548 0.0000000 0 0 0.0000000 11.0 16.1 35.4 Majority White Female Majority

Initial Analysis of Continuous Data

asthma_t_test <- t.test(data$pct_asthma)
obesity_t_test <- t.test(data$pct_obesity)
mental_t_test <- t.test(data$pct_mental)
walkability_t_test <- t.test(data$avg_w_ind)
mwalk_t_test <- t.test(data$pct_walk)
mcar_t_test <- t.test(data$pct_car)
mpublic_t_test <- t.test(data$pct_public)
mbike_t_test <- t.test(data$pct_bike)
mtaxi_t_test <- t.test(data$pct_taxi)
mmotorcycle_t_test <- t.test(data$pct_motorcycle)
mother_t_test <- t.test(data$pct_other)
asthma_quartiles <- quantile(data$pct_asthma, na.rm = TRUE)
obesity_quartiles <- quantile(data$pct_obesity, na.rm = TRUE)
mental_quartiles <- quantile(data$pct_mental, na.rm = TRUE)

walk_quartiles <- quantile(data$pct_walk, na.rm = TRUE)
bike_quartiles <- quantile(data$pct_bike, na.rm = TRUE)
car_quartiles <- quantile(data$pct_car, na.rm = TRUE)
taxi_quartiles <- quantile(data$pct_taxi, na.rm = TRUE)
public_quartiles <- quantile(data$pct_public, na.rm = TRUE)
motorcycle_quartiles <- quantile(data$pct_motorcycle, na.rm = TRUE)
other_quartiles <- quantile(data$pct_other, na.rm = TRUE)
asthma_st_dev <- sd(data$pct_asthma, na.rm = TRUE)
obesity_st_dev <- sd(data$pct_obesity, na.rm = TRUE)
mental_st_dev <- sd(data$pct_mental, na.rm = TRUE)

walk_st_dev <- sd(data$pct_walk, na.rm = TRUE)
bike_st_dev <- sd(data$pct_bike, na.rm = TRUE)
car_st_dev <- sd(data$pct_car, na.rm = TRUE)
taxi_st_dev <- sd(data$pct_taxi, na.rm = TRUE)
public_st_dev <- sd(data$pct_public, na.rm = TRUE)
motorcycle_st_dev <- sd(data$pct_motorcycle, na.rm = TRUE)
other_st_dev <- sd(data$pct_other, na.rm = TRUE)
asthma_hist <- ggplot(data) +
  geom_histogram(aes(x = pct_asthma),
                 bins = 30)

obesity_hist <- ggplot(data) +
  geom_histogram(aes(x = pct_obesity),
                 bins = 30) +
  scale_x_continuous(trans = "log")

mental_hist <- ggplot(data) +
  geom_histogram(aes(x = pct_mental),
                 bins = 30)

asthma_hist
## Warning: Removed 15 rows containing non-finite values (stat_bin).

mode_summary <- tibble(
  Variable = c("Percentage of Workers Who Commute by Walking", 
               "Percentage of Workers Who Commute by Bike", 
               "Percentage of Workers Who Commute by Car",
               "Percentage of Workers Who Commute by Public Transit",
               "Percentage of Workers Who Commute by Taxi",
               "Percentage of Workers Who Commute by Motorcycle",
               "Percentage of Workers Who Commute by Other Means"),
  `Sample mean` = c((mwalk_t_test$estimate*100),
                    (mbike_t_test$estimate*100),
                    (mcar_t_test$estimate*100),
                    (mpublic_t_test$estimate*100),
                    (mtaxi_t_test$estimate*100),
                    (mmotorcycle_t_test$estimate*100),
                    (mother_t_test$estimate*100)),
  
  `Population mean (95% confidence) - low` = 
    c((mwalk_t_test$conf.int[1]*100),
      (mbike_t_test$conf.int[1]*100),
      (mcar_t_test$conf.int[1]*100),
      (mpublic_t_test$conf.int[1]*100),
      (mtaxi_t_test$conf.int[1]*100),
      (mmotorcycle_t_test$conf.int[1]*100),
      (mother_t_test$conf.int[1]*100)),
  
  `Population mean (95% confidence) - high` =
    c((mwalk_t_test$conf.int[2]*100),
      (mbike_t_test$conf.int[2]*100),
      (mcar_t_test$conf.int[2]*100),
      (mpublic_t_test$conf.int[2]*100),
      (mtaxi_t_test$conf.int[2]*100),
      (mmotorcycle_t_test$conf.int[2]*100),
      (mother_t_test$conf.int[2])*100),
  
  Median = c((mwalk_t_test$conf.int[3]*100),
      (mbike_t_test$conf.int[3]*100),
      (mcar_t_test$conf.int[3]*100),
      (mpublic_t_test$conf.int[3]*100),
      (mtaxi_t_test$conf.int[3]*100),
      (mmotorcycle_t_test$conf.int[3]*100),
      (mother_t_test$conf.int[3]*100)),
  
  `Interquartile range` = c(((walk_quartiles[4] - walk_quartiles[2])*100),
                            ((bike_quartiles[4] - bike_quartiles[2])*100),
                            ((car_quartiles[4] - car_quartiles[2])*100),
                            ((public_quartiles[4] - public_quartiles[2])*100),
                            ((taxi_quartiles[4] - taxi_quartiles[2])*100),
                            ((motorcycle_quartiles[4] - motorcycle_quartiles[2])*100),
                            ((other_quartiles[4] - other_quartiles[2])*100)),
  `Standard deviation` = c((walk_st_dev*100), (bike_st_dev*100), (car_st_dev*100), (taxi_st_dev*100), (public_st_dev*100), (motorcycle_st_dev*100), (other_st_dev*100)))

kable(mode_summary, digits = 2)
Variable Sample mean Population mean (95% confidence) - low Population mean (95% confidence) - high Median Interquartile range Standard deviation
Percentage of Workers Who Commute by Walking 5.58 5.07 6.09 NA 4.63 10.02
Percentage of Workers Who Commute by Bike 0.89 0.78 0.99 NA 0.72 2.04
Percentage of Workers Who Commute by Car 76.42 75.39 77.45 NA 20.37 20.05
Percentage of Workers Who Commute by Public Transit 10.50 9.83 11.18 NA 12.95 0.69
Percentage of Workers Who Commute by Taxi 0.29 0.25 0.32 NA 0.08 13.17
Percentage of Workers Who Commute by Motorcycle 0.08 0.06 0.10 NA 0.00 0.41
Percentage of Workers Who Commute by Other Means 1.03 0.95 1.12 NA 1.42 1.58
pretty_asthma_hist <- asthma_hist +
  theme_bw() +
  scale_x_continuous(name = "Percentage of residents with asthma") +
  scale_y_continuous(name = "Number of tracts") +
  theme(axis.text.x = element_text(angle = 90))

pretty_obesity_hist <- obesity_hist +
  theme_bw() +
  scale_x_continuous(name = "Percentage of residents with obesity") +
  scale_y_continuous(name = "Number of tracts") +
  theme(axis.text.x = element_text(angle = 90))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
pretty_mental_hist <- mental_hist +
  theme_bw() +
  scale_x_continuous(name = "Percentage of residents with poor mental health") +
  scale_y_continuous(name = "Number of tracts") +
  theme(axis.text.x = element_text(angle = 90))

grid.arrange(pretty_asthma_hist, pretty_obesity_hist, pretty_mental_hist,
             ncol = 3)
## Warning: Removed 15 rows containing non-finite values (stat_bin).

## Warning: Removed 15 rows containing non-finite values (stat_bin).

## Warning: Removed 15 rows containing non-finite values (stat_bin).

Initial Analysis of Categorical Data

pct_white <- t.test(data$white_majority == "Majority White")
pct_nonwhite <-  t.test(data$white_majority == "Non-Majority White")

pct_female <- t.test(data$female_majority == "Female Majority")
pct_notfemale <-  t.test(data$female_majority == "Non-Female Majority")
cat_summary_race <- tibble(`Majority Race` = 
                              c("White",
                                "Not White"),
                            `Sample proportion` = 
                              c(pct_white$estimate * 100,
                                pct_nonwhite$estimate *100),
                            `95-percent confidence - low` = 
                              c(pct_white$conf.int[1] * 100,
                                pct_nonwhite$conf.int[1] * 100),
                            `95-percent confidence - high` = 
                              c(pct_white$conf.int[2] * 100,
                                pct_nonwhite$conf.int[2] * 100))

kable(cat_summary_race, digits = 2)
Majority Race Sample proportion 95-percent confidence - low 95-percent confidence - high
White 89.51 87.95 91.08
Not White 9.54 8.04 11.04
cat_summary_sex <- tibble(`Majority Sex` = 
                              c("Female",
                                "Not Female"),
                            `Sample proportion` = 
                              c(pct_female$estimate * 100,
                                pct_notfemale$estimate *100),
                            `95-percent confidence - low` = 
                              c(pct_female$conf.int[1] * 100,
                                pct_notfemale$conf.int[1] * 100),
                            `95-percent confidence - high` = 
                              c(pct_female$conf.int[2] * 100,
                                pct_notfemale$conf.int[2] * 100))

kable(cat_summary_sex, digits = 2)
Majority Sex Sample proportion 95-percent confidence - low 95-percent confidence - high
Female 69.08 66.72 71.44
Not Female 29.97 27.63 32.31

Assignment - Bivariate regression

In trying to understand the importance of walkability to health and commuting patterns, we ran a regression analysis for correlation using ‘cor.test’ and ‘age_model’, then plotted the data on a graph. Our dependent variable is “walkability” provided through a Massachusetts Census Tracts’ walkability score. Our independent variables are divided into two fields: health metrics and commuting choice. For health metrics, we are using the percent of the population afflicted with asthma, obesity, or mental health issues respectively. For commuting choice, we’re using the percent of the population that commutes via walking, biking, public transit, or car respectively.

Walkability and Health Metrics

Walkability and Asthma

cor.test(~ avg_w_ind + pct_asthma, data = data)
## 
##  Pearson's product-moment correlation
## 
## data:  avg_w_ind and pct_asthma
## t = 11.883, df = 1461, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2494104 0.3428982
## sample estimates:
##       cor 
## 0.2968655
age_model <- lm(avg_w_ind ~ pct_asthma, data = data)

summary(age_model)
## 
## Call:
## lm(formula = avg_w_ind ~ pct_asthma, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8997  -3.1243   0.3704   2.9462   8.7244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.4426     0.7741   3.155  0.00164 ** 
## pct_asthma    0.8722     0.0734  11.883  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.723 on 1461 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.08813,    Adjusted R-squared:  0.0875 
## F-statistic: 141.2 on 1 and 1461 DF,  p-value: < 2.2e-16
ggplot(data) +
  geom_point(aes(x = avg_w_ind, y = pct_asthma)) +
  geom_smooth(aes(x = avg_w_ind, y = pct_asthma), color = 'red', method = 'lm', se = FALSE) +
  labs(x = "Walkability Score", 
       y = "Percent of Pop. with Asthma", 
       title = "Regression: Walkability Score and Asthma Rates, Correlation: 0.29")

For asthma, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are positive. Additionally, this finding is supported by the regression, with an R-squared value of 0.08 and a p-value of less than 0.05. This means we can say with 95-percent confidence that higher asthma rates are associated with more walkable census tracts and that it is statistically significant.

Walkability and Obesity

cor.test(~ avg_w_ind + pct_obesity, data = data)
## 
##  Pearson's product-moment correlation
## 
## data:  avg_w_ind and pct_obesity
## t = -0.48809, df = 1461, p-value = 0.6256
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06397625  0.03850640
## sample estimates:
##         cor 
## -0.01276846
age_model <- lm(avg_w_ind ~ pct_obesity, data = data)

summary(age_model)
## 
## Call:
## lm(formula = avg_w_ind ~ pct_obesity, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8134 -3.5706  0.9706  3.2836  7.4787 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.798894   0.484225  24.367   <2e-16 ***
## pct_obesity -0.008676   0.017775  -0.488    0.626    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.899 on 1461 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.000163,   Adjusted R-squared:  -0.0005213 
## F-statistic: 0.2382 on 1 and 1461 DF,  p-value: 0.6256
ggplot(data) +
  geom_point(aes(x = avg_w_ind, y = pct_obesity)) +
  geom_smooth(aes(x = avg_w_ind, y = pct_obesity), color = 'red', method = 'lm', se = FALSE) +
  labs(x = "Walkability Score", 
       y = "Percent of Pop. with Obesity", 
       title = "Regression: Walkability Score and Obesity Rates, Correlation: -0.01")

For obesity, the 95-percent confidence interval for the correlation includes zero, meaning the direction of the correlation (positive or negative) is uncertain. Additionally, this finding is supported by the regression, with an R-squared value of -0.00052 and a p-value of 0.625. This means we cannot inform with with confidence whether obesity and walkable census tracts are correlated and that this finding’s significance is indeterminable with the present evidence.

Walkability and Mental Health

cor.test(~ avg_w_ind + pct_mental, data = data)
## 
##  Pearson's product-moment correlation
## 
## data:  avg_w_ind and pct_mental
## t = 13.737, df = 1461, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2920163 0.3828188
## sample estimates:
##       cor 
## 0.3382045
age_model <- lm(avg_w_ind ~ pct_mental, data = data)

summary(age_model)
## 
## Call:
## lm(formula = avg_w_ind ~ pct_mental, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7242  -2.9927   0.3731   2.9256   8.3049 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.08786    0.41030   14.84   <2e-16 ***
## pct_mental   0.39927    0.02907   13.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.669 on 1461 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.1144, Adjusted R-squared:  0.1138 
## F-statistic: 188.7 on 1 and 1461 DF,  p-value: < 2.2e-16
ggplot(data) +
  geom_point(aes(x = avg_w_ind, y = pct_mental)) +
  geom_smooth(aes(x = avg_w_ind, y = pct_mental), color = 'red', method = 'lm', se = FALSE) +
  labs(x = "Walkability Score", 
       y = "Percent of Pop. with Mental Health Issues", 
       title = "Regression: Walkability Score and Mental Health Rates, Correlation: 0.33")

For mental health, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are positive. Additionally, this finding is supported by the regression, with an R-squared value of 0.11 and a p-value of less than 0.05. This means we can say with 95-percent confidence that higher incidence of mental health issues are associated with more walkable census tracts and that this finding is statistically significant.

Walkability and Commute Choice

Walkability and Commuting by Walking

cor.test(~ avg_w_ind + pct_walk, data = data)
## 
##  Pearson's product-moment correlation
## 
## data:  avg_w_ind and pct_walk
## t = 16.317, df = 1462, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3482749 0.4349887
## sample estimates:
##       cor 
## 0.3925036
age_model <- lm(avg_w_ind ~ pct_walk, data = data)

summary(age_model)
## 
## Call:
## lm(formula = avg_w_ind ~ pct_walk, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8213  -3.1892   0.6462   3.0897   7.3097 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.7073     0.1075   99.60   <2e-16 ***
## pct_walk     15.2938     0.9373   16.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.594 on 1462 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.1541, Adjusted R-squared:  0.1535 
## F-statistic: 266.3 on 1 and 1462 DF,  p-value: < 2.2e-16
ggplot(data) +
  geom_point(aes(x = avg_w_ind, y = pct_walk)) +
  geom_smooth(aes(x = avg_w_ind, y = pct_walk), color = 'red', method = 'lm', se = FALSE) +
  labs(x = "Walkability Score", 
       y = "Percent of Pop. Commuting Via Walking", 
       title = "Regression: Walkability Score and Walking Use, Correlation: 0.39")

For commuting by walking, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are positive. Additionally, this finding is supported by the regression, with an R-squared value of 0.15 and a p-value of less than 0.05. This means we can say with 95-percent confidence that higher commuting rates by walking are associated with more walkable census tracts and that this finding is statistically significant.

Walkability and Commuting by Biking

cor.test(~ avg_w_ind + pct_bike, data = data)
## 
##  Pearson's product-moment correlation
## 
## data:  avg_w_ind and pct_bike
## t = 14.107, df = 1462, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3002402 0.3904558
## sample estimates:
##      cor 
## 0.346148
age_model <- lm(avg_w_ind ~ pct_bike, data = data)

summary(age_model)
## 
## Call:
## lm(formula = avg_w_ind ~ pct_bike, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9704 -3.2022  0.5954  3.0827  7.7655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.9704     0.1045  104.94   <2e-16 ***
## pct_bike     66.4250     4.7085   14.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.666 on 1462 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.1198, Adjusted R-squared:  0.1192 
## F-statistic:   199 on 1 and 1462 DF,  p-value: < 2.2e-16
ggplot(data) +
  geom_point(aes(x = avg_w_ind, y = pct_bike)) +
  geom_smooth(aes(x = avg_w_ind, y = pct_bike), color = 'blue', method = 'lm', se = FALSE) +
  labs(x = "Walkability Score", 
       y = "Percent of Pop. Commuting Via Biking", 
       title = "Regression: Walkability Score and Biking Use, Correlation: 0.34")

For commuting by biking, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are positive. Additionally, this finding is supported by the regression, with an R-squared value of 0.11 and a p-value of less than 0.05. This means we can say with 95-percent confidence that higher commuting rates by biking are associated with more walkable census tracts and that this finding is statistically significant.

Walkability and Commuting by Public Transit

cor.test(~ avg_w_ind + pct_public, data = data)
## 
##  Pearson's product-moment correlation
## 
## data:  avg_w_ind and pct_public
## t = 19.814, df = 1462, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4187422 0.4995602
## sample estimates:
##       cor 
## 0.4601037
age_model <- lm(avg_w_ind ~ pct_public, data = data)

summary(age_model)
## 
## Call:
## lm(formula = avg_w_ind ~ pct_public, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.7747  -2.8021   0.2858   2.7100   8.1044 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.1272     0.1160   87.31   <2e-16 ***
## pct_public   13.6476     0.6888   19.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.469 on 1462 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.2117, Adjusted R-squared:  0.2112 
## F-statistic: 392.6 on 1 and 1462 DF,  p-value: < 2.2e-16
ggplot(data) +
  geom_point(aes(x = avg_w_ind, y = pct_public)) +
  geom_smooth(aes(x = avg_w_ind, y = pct_public), color = 'green', method = 'lm', se = FALSE) +
  labs(x = "Walkability Score", 
       y = "Percent of Pop. Commuting Via Transit", 
       title = "Regression: Walkability Score and Transit Use, Correlation: 0.46")

For commuting by transit, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are positive. Additionally, this finding is supported by the regression, with an R-squared value of 0.21 and a p-value of less than 0.05. This means we can say with 95-percent confidence that higher commuting rates by transit are associated with more walkable census tracts and that this finding is statistically significant.

Walkability and Commuting by Car

cor.test(~ avg_w_ind + pct_car, data = data)
## 
##  Pearson's product-moment correlation
## 
## data:  avg_w_ind and pct_car
## t = -23.1, df = 1462, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5536691 -0.4785508
## sample estimates:
##       cor 
## -0.517105
age_model <- lm(avg_w_ind ~ pct_car, data = data)

summary(age_model)
## 
## Call:
## lm(formula = avg_w_ind ~ pct_car, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.2611  -2.7227   0.1955   2.6107   8.0268 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.2611     0.3446   55.89   <2e-16 ***
## pct_car     -10.0765     0.4362  -23.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.344 on 1462 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.2674, Adjusted R-squared:  0.2669 
## F-statistic: 533.6 on 1 and 1462 DF,  p-value: < 2.2e-16
ggplot(data) +
  geom_point(aes(x = avg_w_ind, y = pct_car)) +
  geom_smooth(aes(x = avg_w_ind, y = pct_car), color = 'orange', method = 'lm', se = FALSE) +
  labs(x = "Walkability Score", 
       y = "Percent of Pop. Commuting Via Car", 
       title = "Regression: Walkability Score and Car Use, Correlation: -0.51")

For commuting by car, the 95-percent confidence interval for the correlation does not include zero - all values in the interval are negative. Additionally, this finding is supported by the regression, with an R-squared value of 0.26 and a p-value of less than 0.05. This means we can say with 95-percent confidence that lower commuting rates by car are associated with more walkable census tracts and that this finding is statistically significant.

Categorical variables

Sex_majority

sex_model <- lm(avg_w_ind ~ female_majority, data = data)

summary(sex_model)
## 
## Call:
## lm(formula = avg_w_ind ~ female_majority, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.402  -3.602   0.950   3.292   7.702 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         11.6292     0.1222   95.13   <2e-16 ***
## female_majorityNon-Female Majority  -0.2267     0.2222   -1.02    0.308    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.906 on 1462 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.0007113,  Adjusted R-squared:  2.781e-05 
## F-statistic: 1.041 on 1 and 1462 DF,  p-value: 0.3078
ggplot(data) +
  geom_point(aes(x = avg_w_ind, y = female_majority)) +
  geom_smooth(aes(x = avg_w_ind, y = female_majority), color = 'orange', method = 'lm', se = FALSE) +
  labs(x = "Walkability Score", 
       y = "Percent of Pop. that identify as female", 
       title = "Walkability Score by Sex")

Race_majority

race_model <- lm(avg_w_ind ~ white_majority, data = data)

summary(race_model)
## 
## Call:
## lm(formula = avg_w_ind ~ white_majority, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9151  -3.2729   0.3703   3.1949   7.8655 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       11.2386     0.1039 108.125   <2e-16 ***
## white_majorityNon-Majority White   3.3431     0.3349   9.981   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.781 on 1462 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.0638, Adjusted R-squared:  0.06316 
## F-statistic: 99.63 on 1 and 1462 DF,  p-value: < 2.2e-16
ggplot(data) +
  geom_point(aes(x = avg_w_ind, y = white_majority)) +
  geom_smooth(aes(x = avg_w_ind, y = white_majority), color = 'orange', method = 'lm', se = FALSE) +
  labs(x = "Walkability Score", 
       y = "Percent of Pop. that identify as White", 
       title = "Walkability Score by Race")

Although the sex_category has is not statistically significant, however, it shows that census tracts with a higher female_majority increase walkability score for the census tract.

We can also see that Race_majority is statistically significant, census tracts with higher non-majority white race increase the walkability score for the census tract.

Assignment 4 - Multivariate regression

full_model <- lm(avg_w_ind ~ pct_asthma + pct_obesity + pct_mental + pct_walk + pct_bike + pct_public + pct_car + pct_taxi + pct_motorcycle + pct_other + white_majority, data)

summary(full_model)
## 
## Call:
## lm(formula = avg_w_ind ~ pct_asthma + pct_obesity + pct_mental + 
##     pct_walk + pct_bike + pct_public + pct_car + pct_taxi + pct_motorcycle + 
##     pct_other + white_majority, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7785  -2.3036  -0.0372   2.2156   7.9100 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -1.614270   2.211977  -0.730 0.465639    
## pct_asthma                        0.003785   0.161408   0.023 0.981297    
## pct_obesity                      -0.041913   0.025393  -1.651 0.099038 .  
## pct_mental                        0.332207   0.068165   4.874 1.22e-06 ***
## pct_walk                         13.791473   2.694690   5.118 3.50e-07 ***
## pct_bike                         38.529980   5.115190   7.532 8.71e-14 ***
## pct_public                       17.431580   2.233629   7.804 1.14e-14 ***
## pct_car                           8.096956   2.244250   3.608 0.000319 ***
## pct_taxi                         83.076529  12.046489   6.896 7.94e-12 ***
## pct_motorcycle                   22.946219  18.966155   1.210 0.226533    
## pct_other                        30.050899   5.836046   5.149 2.98e-07 ***
## white_majorityNon-Majority White  0.085221   0.319473   0.267 0.789696    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.947 on 1451 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.4328, Adjusted R-squared:  0.4285 
## F-statistic: 100.6 on 11 and 1451 DF,  p-value: < 2.2e-16

In our bivariate regression, walkability scores increases in census tracts that have higher numbers of the following medical conditions: Asthma and mental health issues, also increases with a higher percentage of the population who uses the following modes of transit: Walk, Bike, Public Transit. Walkability scores also increase with an increased population who identifies as non-majority white.

When we control for all medical conditions variables (asthma, obesity, mental issues) and all different modes of transportation variables (walk, bike, public transit, car), we found that the percentage of population that commutes by bikes contributes by 33.52615 positively to the walkability score of the census tract, making it the most significant category. Similarly, the population that uses public transit contributes to the increase of walkability in a census tract by 11.57760. Regarding medical conditions, populations with high mental issues contribute significantly to the walkability of the census tract by 0.476.

Although the female_majorty variant was not significant as a bivariant, combined with other variants in the multivariant regression, it increases in significance. Non_female majority census tracts decrease the walkability score -0.63827

This explains that census tracts with access to public transit, walking routes, and bike paths, with population, that with majority non-white race, female majority, that report a higher number of mental issues and asthma are the most significant variants to increase walkability scores.

Overall, our model explains about 40 percent of the variation in walkability-level can be explained with these variables.

Assignment 5 - Transformations

centered_data <- data %>%
  mutate (avg_w_ind = avg_w_ind - mean(avg_w_ind, na.rm=TRUE),
         pct_walk = pct_walk - mean(pct_walk, na.rm=TRUE),
         pct_car = pct_car - mean(pct_car, na.rm=TRUE),
         pct_public = pct_public - mean(pct_public, na.rm=TRUE),
         pct_bike = pct_bike - mean(pct_bike, na.rm=TRUE),
         pct_taxi = pct_taxi - mean(pct_taxi, na.rm=TRUE),
         pct_motorcycle = pct_motorcycle - mean(pct_motorcycle, na.rm=TRUE),
         pct_other = pct_other - mean(pct_other, na.rm=TRUE),
         pct_asthma = pct_asthma - mean(pct_asthma, na.rm=TRUE),
         pct_mental = pct_mental - mean(pct_mental, na.rm=TRUE),
         pct_obesity = pct_obesity - mean(pct_obesity, na.rm=TRUE))

centered_model <- lm(avg_w_ind ~ pct_walk + pct_car + pct_public + pct_bike + pct_taxi + pct_motorcycle + pct_other + pct_asthma + pct_mental + pct_obesity + white_majority, centered_data)

export_summs(full_model, centered_model, 
             error_format = "(p = {p.value})",
             error_pos = "same",
             model.names = c("Initial", "Centered"))
InitialCentered
(Intercept)-1.61 (p = 0.47)0.00 (p = 0.96)
pct_asthma0.00 (p = 0.98)0.00 (p = 0.98)
pct_obesity-0.04 (p = 0.10)-0.04 (p = 0.10)
pct_mental0.33 *** (p = 0.00)0.33 *** (p = 0.00)
pct_walk13.79 *** (p = 0.00)13.79 *** (p = 0.00)
pct_bike38.53 *** (p = 0.00)38.53 *** (p = 0.00)
pct_public17.43 *** (p = 0.00)17.43 *** (p = 0.00)
pct_car8.10 *** (p = 0.00)8.10 *** (p = 0.00)
pct_taxi83.08 *** (p = 0.00)83.08 *** (p = 0.00)
pct_motorcycle22.95 (p = 0.23)22.95 (p = 0.23)
pct_other30.05 *** (p = 0.00)30.05 *** (p = 0.00)
white_majorityNon-Majority White0.09 (p = 0.79)0.09 (p = 0.79)
N1463    1463    
R20.43 0.43 
*** p < 0.001; ** p < 0.01; * p < 0.05.

Analysis

In our multi-variate regression, we noticed that our intercept was a negative number. We found this interesting for interpretation because the regression table as originally estimated tells us that if all other variables (means of commute, certain health indicators, and ) are held at zero, we would expect a negative Walkability Index Score. Given it’s not possible to have a negative Walkability Index score, we thought it would be valuable to mean center our continuous variables in order to see how the intercept value might shift if all variables were end at their mean value.

What we found in the process of mean centering our data did very little to improve the overall analysis of our regression table. With a p-value of .96, the mean-centered intercept was statistically insignificant. Even if it had produced a significant p-value, the estimated walkability score of 0 does not offer any new insight as once again this isn’t a possible score on the Walkability Index (1-20). As expected, all other estimated coefficient values stayed constant despite the mean-centering. Ultimately, we prefer this second model, as it shows that mean-centering has been attempted (as opposed to not addressing the negative intercept value in the non-mean centered multi-variate regression) but we acknowledge this does not offer any new insight.

Resources

Centers for Disease Control and Prevention, “PLACES: Local Data for Better Health, Census Tract Data 2020 release”. 2020. https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh

United States Census Bureau. American Community Survey, 5-year estimates. 2019.

Watson et al., “Associations between the National Walkability Index and walking among US Adults - National Health Interview Survey, 2015”. 2020. https://pubmed.ncbi.nlm.nih.gov/32389677/

Riggs, William Warren and Suresh Andrew Sethi, “Multimodal travel behaviour, walkability indices, and social mobility: how neighbourhood walkability, income and household characteristics guide walking, biking & transit decisions”. 2020. https://www.researchgate.net/publication/344647424_Multimodal_travel_behaviour_walkability_indices_and_social_mobility_how_neighbourhood_walkability_income_and_household_characteristics_guide_walking_biking_transit_decisions